INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

0bservation : No duplicate values found

Observation :

Observation : No null/Missing values found

Observation :

Exploratory Data Analysis (EDA)

Univariate Analysis using both histogram_box distribution plots and labeled_barplot percentage plots

no_of_adults

Observation :

no_of_children

Observation:

no_of_weekend_nights

Observation:

no_of_week_nights

Observation:

required_car_parking_space

Observation :

lead_time

Observation :

arrival_year

Observation :

arrival_month

Observation :

arrival_date

Observation :

repeated_guest

Observation :

no_of_previous_cancellations

Observations :

no_of_previous_bookings_not_canceled

Observations :

avg_price_per_room

Observation:

no_of_special_requests

Observation :

type_of_meal_plan

Observation:

room_type_reserved

Observation:

market_segment_type

Observation:

booking_status

Observation:

Bivariate Analysis against the target varible "booking_status"

Type of plots used

* Stacked Bar plot

* Distribution plot

* Conducting bi variate analysis against the target varible with all the independent variables individually to understand how it effects the target variable indivdually.

'no_of_adults' vs 'booking_status'

Observation :

'no_of_children' vs 'booking_status'

Observation:

'no_of_weekend_nights' vs 'booking_status'

Observation :

'no_of_week_nights' vs 'booking_status'

Observation:

'type_of_meal_plan' vs 'booking_status'

Observation :

'required_car_parking_space' vs 'booking_status'

Observation :

'room_type_reserved' vs 'booking_status'

Observation :

'lead_time' vs 'booking_status'

Observation:

'arrival_year'vs 'booking_status'

Observation:

'arrival_month' vs 'booking_status'

Observation:

'arrival_date', 'booking_status'

Observation:

'market_segment_type' vs 'booking_status'

0bservation:

'repeated_guest' vs 'booking_status'

Observation:

'no_of_previous_cancellations' vs 'booking_status'

Observation:

Observation:

'no_of_special_requests' vs 'booking_status'

Observation :

'avg_price_per_room' vs 'booking_status'

Observation:

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Question 1: What are the busiest months in the hotel?

Observation:

10th month which is October is the busiest month.

Question 2: Which market segment do most of the guests come from?

Observation :

most of the guests are from the online market segment by 64%

Question 3: Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observation:

Question 4: What percentage of bookings are canceled?

Observation :

32 percentage of bookings are canceled

Question 5: Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observation:

Question 6: Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observation:

Multi variate Analysis

Performing multi variate analysis to check the distribution and correlation between the variables

Type of plots used

* Histogram plot

* Pairplot

* Heatmap

Check for distribution and Correlation

Histogram

Observation

Pairplot

Observation

Summary of the EDA

Data Description:

Observations from EDA:

Data that requires Preprocessing:

Data Preprocessing

Obsevation : There are no missing values in the data and no missing value treatment is needed.

outlier treatment

no of children column has extreme values 9 and 10 .Replacing 9, and 10 children with 3

Observation :

Treating the column avg price per room which has one unique outlier 500 Euros. Treating and replacing it with upper whisker

Observation : The 500 value has been replaced and it doesnt show in the data any more

Feature engineering

Log Tranformation

Observation :

Outlier detection and treatment

Observation:

Observation : Outliers have been treated.

Data Preparation for modeling

EDA after Data Pre processing

Univariate Analysis

Observation

Observation :

Observation :

Bivariate Analysis

Bivariate analysis of lead time against booking_status after outlier treatment

Observation

Bivariate analysis of avg_price_per_room against booking_status after outlier treatment

Observation

Plotting Heat map to check the correlation post data Pre processing

Observation

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
  2. Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?

How to reduce the losses?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Building a Logistic Regression model

Observations

Checking Multicollinearity

Observation

Dropping high p-value variables

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

Observation :

Coefficient interpretations

Obsevration:

Converting coefficients to odds

Observations:

Model performance evaluation

Checking model performance on the training set

ROC-AUC

Observation :

Model Performance Improvement

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Let's check the performance on the test set

Using model with default threshold

Using model with threshold=0.37

Final Model Summary

Conclusion

Building a Decision Tree model

Checking model performance confusion matix on Train and Test

Checking model performance scores on Train and Test

Observation:

Do we need to prune the tree?

Before pruning the tree let's check the important features.

observation :

Pruning the tree

Using GridSearch for Hyperparameter tuning of our tree model

Pre - Pruning

Checking model performance confusion matix on Train and Test

Checking model performance scores on Train and Test

Observation :

Visualizing the Decision Tree

Observation :

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

F1 Score vs alpha for training and testing sets

Model Performance Comparison and Conclusions

Checking model performance confusion matix on Train and Test

Checking model performance scores on Train and Test

Observation :

Comparing Decision Tree models

Comparing Logistic and Decision Tree model scores on Test data

Observation:

Actionable Insights and Recommendations

Actionable Insights:

Reccomendations: